Best Practices & Common Pitfalls in Machine Learning
Data Biases & Sampling Issues
- Sample bias: Occurs when the sample is not representative of the population.
- Selection bias: Certain kinds of data are systematically included or excluded.
- Response bias: Labels or annotations are systematically influenced (e.g., survey bias).
- real-business scenario:
- activity bias (social media content)
- societal bias (human generated content)
- selection bias (generated by the model itself to form a feedback loop)
Data Drift & Distribution Shift
- covariant drift:
- data distribution of independent variables shifts
- label/prior drift:
- data distribution of target variables shifts (the output distribution changes but, for a given output, the input distribution stays the same)
- concept/posterior drift:
- the definition of label itself changes based on a particular feature (the input distribution remains the same but the conditional distribution of the output given an input changes)
- general data distribution shifts
- feature definition change
- label schema change
Statistical Pitfalls
- Endogeneity
- = a situation where there is a correlation between the predictor variables and the error term in a statistical model.
- e.g. Rich becomes richer, poor becomes poorer.
- Correlation vs. Causation
- Multicollinearity
- Underfitting vs. Overfitting
Imbalanced Datasets
Why it matters
Accuracy becomes misleading when classes are imbalanced.
Solutions
Choose better metrics
- Choose other metrics for classification problems, such as precision, recall, F-score, balanced accuracy
Data-level methods: Resampling
- Undersampling: down-sampling the larger set
- by randomly throwing away some data from that set
- Oversampling: up-sampling the smaller set
- Direct duplication: by making multiple copies of the data points in the smaller set (can cause the model to be over-fitting)
- by using synthetic data creation such as:
- synthetic minority oversampling technique (SMOTE)
- use the existing data in the smaller set to create new data points that look like the existing ones: use the feature vectors of the minority classes to generate syntehtic data points that are between real data points and their k-nearest neighbours
- adaptive synthetic sampling methods (ADASYN)
- synthetic minority oversampling technique (SMOTE)
- Algorithm-level methods
- use Ensemble Learning methods, because each model in the ensemble can be trained on a different subset of the data
- keep the training data distribution intact but alter the algorithm to make it more robust to class imbalance
- cost-sensitive learning
- class-balanced loss
- focal loss
Use metric to measure imblance
- Class imbalance: the imbalance in the number of members between different facet values
- Difference in proportions of labels (DPL): the imbalance of positive outcomes between different facet values
Data Labeling & Label Quality
- correctly labeled datasets are often called "ground truth"
- efficient data labelling
- access to additional human workforces: Machine Learning Systems Design#Human-in-the-Loop Pipelines
- automated data labelling capabilities
- assistive labelling features
- label multiplicity/ambiguity
- multiple annotators/data sources cause conflicting labels
- solutions: majority vote, soft labels, probabilistic labels, annotator modeling
Feature Generalization
Always consider two aspects with regards to generalization:
- feature coverage
- the percentage of the samples that has values for this feature in the data -> the fewer values that are missing, the higher the coverage.
- the distribution of feature values
Model Selection
Always consider:
- Interpretability
- Complexity
- Generalization ability
- Operational constraints (latency, cost, serving environment)